In [1]:
using PyPlot
In [2]:
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = zeros(N*K,D) # data matrix (each row = single example)
y = zeros(Int64, N*K) # class labels
for j in range(1,K)
idx = range(1+N*(j-1), N); #index for X and Y
r = linspace(0.0,1,N); # radius
t = linspace((j-1)*4,(j)*4,N) + randn(N)*0.2 # theta
X[idx,:] = [r.*sin(t) r.*cos(t)]
y[idx] = j;
end
In [3]:
# lets visualize the data:
scatter(X[:, 1], X[:, 2], s=40, c=y, alpha=0.5)
Out[3]:
In [4]:
# initialize parameters randomly
W = 0.01 * randn(D,K)
b = zeros((1,K))
Out[4]:
In [5]:
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
Out[5]:
In [6]:
num_examples = size(X,1)
Out[6]:
array probs of size [300 x 3], where each row now contains the class probabilities. In particular, since we've normalized them every row now sums to one. We can now query for the log probabilities assigned to the correct classes in each example:
corect_logprobs = -np.log(probs[range(num_examples),y])
The array correct_logprobs is a 1D array of just the probabilities assigned to the correct classes for each example. The full loss is then the average of these log probabilities and the regularization loss:
compute the loss: average cross-entropy loss and regularization
data_loss = np.sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
in this example lets use the cross-entropy loss that is associated with the Softmax classifier. Recall that if f is the array of class scores for a single example (e.g. array of 3 numbers here), then the Softmax classifier computes the loss for that example as:
$$ L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) $$We can see that the Softmax classifier interprets every element of f as holding the (unnormalized) log probabilities of the three classes. We exponentiate these to get (unnormalized) probabilities, and then normalize them to get probabilites. Therefore, the expression inside the log is the normalized probability of the correct class. Note how this expression works: this quantity is always between 0 and 1. When the probability of the correct class is very small (near 0), the loss will go towards (postiive) infinity. Conversely, when the correct class probability goes towards 1, the loss will go towards zero because log(1)=0. Hence, the expression for Li is low when the correct class probability is high, and it's very high when it is low.
Recall also that the full Softmax classifier loss is then defined as the average cross-entropy loss over the training examples and the regularization:
$$L = \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \frac{1}{2} \lambda \sum_k\sum_l W_{k,l}^2 }_\text{regularization loss} \\$$The loss for one example is:
$$p_k = \frac{e^{f_k}}{ \sum_j e^{f_j} } \hspace{1in} L_i =-\log\left(p_{y_i}\right)$$We now wish to understand how the computed scores inside $f$ should change to decrease the loss $L_i$ that this example contributes to the full objective. In other words, we want to derive the gradient $\partial L_i / \partial f_k$. The loss $L_i$ is computed from $p$, which in turn depends on $f$. It's a fun exercise to the reader to use the chain rule to derive the gradient, but it turns out to be extremely simple and interpretible in the end, after a lot of things cancel out:
$$\frac{\partial L_i }{ \partial f_k } = p_k - 1(y_i = k)$$Notice how elegant and simple this expression is. Suppose the probabilities we computed were p = [0.2, 0.3, 0.5], and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be df = [0.2, -0.7, 0.5]. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector $f$ (the scores of the incorrect classes) leads to an increased loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has negative influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss $L_i$, which makes sense.
In [7]:
# gradient descent loop
for i in 1:200
# compute class scores for a linear classifier
scores =X*W .+ b;
# compute unnormalized class probabilities
exp_scores = exp(scores)
# normalize them for each example
probs = exp_scores./sum(exp_scores,2)
corect_logprobs = zeros(num_examples)
for j in 1:num_examples
corect_logprobs[j] = -log(probs[j,y[j]]);
end
# compute the loss/cost: average cross-entropy loss and regularization
data_loss = sum(corect_logprobs)/(num_examples)
reg_loss = 0.5*reg*sum(W.^2)
loss = data_loss + reg_loss
if i % 10 == 0
println("iteration: ", i, " loss: ", loss);
end
# compute the gradient on scores
dscores = probs
for j in 1:num_examples
dscores[j, y[j]] = dscores[j,y[j]]-1;
end
dscores /= num_examples
# backpropate the gradient to the parameters (W,b)
dW = X'*dscores
db = sum(dscores, 1)
dW += reg*W # regularized gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
end
In [8]:
# evaluate training set accuracy
scores = X*W .+ b
predicted_class = zeros(length(scores))
for i in 1:size(scores,1)
predicted_class[i] = indmax(scores[i,:])
end
correct = 0;
for i in 1:length(y)
if y[i] == predicted_class[i]
correct = correct + 1;
end
end
println("training accuracy: ", correct/length(y))
In [9]:
# plot the resulting classifier
h = 0.02;
x_min = minimum(X[:, 1]) - 1;
x_max = maximum(X[:, 1]) + 1;
y_min = minimum(X[:, 2]) - 1;
y_max = maximum(X[:, 2]) + 1;
numX = convert(Int, floor((x_max - x_min)/h));
xx = zeros(numX);
xx[1] = x_min;
yy = zeros(numX);
yy[1] = y_min;
for i in 2:numX
xx[i] = xx[i-1] + h;
yy[i] = yy[i-1] + h;
end
grid_x = [i for i in xx, j in yy];
grid_y = [j for i in xx, j in yy];
xy = [grid_x[:] grid_y[:]];
z = xy*W .+ b;
zz = zeros(size(z,1));
for i in 1:size(z,1)
zz[i] = indmax(z[i,:])
end
zz = reshape(zz, size(grid_x));
In [10]:
contourf(grid_x, grid_y, zz, cmap=get_cmap("Spectral"), alpha=0.8)
scatter(X[:, 1], X[:, 2], c=y, s=40)
Out[10]:
In [ ]: